Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases
نویسندگان
چکیده
Tabular data on the Web has become a rich source of structured data that is useful for ordinary users to explore. Due to its potential, tables on the Web have recently attracted a number of studies [6, 18] with the goals of understanding the semantics of those Web tables and providing effective search and exploration mechanisms over them. An important part of table understanding and search is column concept determination, i.e., identifying the most appropriate concepts associated with the columns of the tables. The problem becomes especially challenging with the availability of increasingly rich knowledge bases that contain hundreds of millions of entities [10, 31]. In this paper, we focus on an important instantiation of the column concept determination problem, namely, the concepts of a column are determined by fuzzy matching its cell values to the entities within a large knowledge base. We provide an efficient and scalable MapReduce-based solution that is scalable to both the number of tables and the size of the knowledge base and propose two novel techniques: knowledge concept aggregation and knowledge entity partition. We prove that both the problem of finding the optimal aggregation strategy and that of finding the optimal partition strategy are NP-hard, and propose efficient heuristic techniques by leveraging the hierarchy of the knowledge base. Experimental results on real-world datasets show that our method achieves high annotation quality and performance, and scales well.
منابع مشابه
Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases
Cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph have gained increasing attention over the last years and are starting to be deployed within various use cases. However, the content of such knowledge bases is far from being complete, far from always being correct, and suffers from deprecation (i.e. population numbers become outdated after some time). Hence, there...
متن کاملAnnotating and Searching Web Tables Using Entities, Types and Relationships
Tables are a universal idiom to present relational data. Billions of tables on Web pages express entity references, attributes and relationships. This representation of relational world knowledge is usually considerably better than completely unstructured, free-format text. At the same time, unlike manually-created knowledge bases, relational information mined from “organic” Web tables need not...
متن کاملExtracting Knowledge Bases from table-structured Web Resources applied to the semantic based Requirements Engineering Methodology SoftWiki
A lot of information on the Web is provided as HTML formatted tables and CSV files. Such tables contain semantic information that can be derived from the embedded environment of the table as well from the heading of each column. Often the problem of integrating and linking this information into semantic web applications occurs. One way to solve this is a transformation of these tables into OWL ...
متن کاملExpansion of Tail Concept Using Web Tables
Human-curated knowledgebases like Freebase and DBPedia cover popular concepts such as persons, organizations and locations, but many more specific concepts fall into the long tail outside current knowledgebases, such as acidic fruits, HD video formats and renewable resources. These concepts are found in conceptentity pairs automatically extracted from text documents, but they cover a limited nu...
متن کاملAnswering Table Queries on the Web using Column Keywords
We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are much richer sources of structured knowledge than free-format text. However, a corpus of tables harvested from arbitrary HTML web pages presents...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 6 شماره
صفحات -
تاریخ انتشار 2013